Support group query attention in Attention(23) CUDA by Copilot · Pull Request #27082 · microsoft/onnxruntime

Copilot · 2026-01-20T21:51:30Z

This pull request introduces improvements and bug fixes to the attention mechanism in ONNX Runtime, particularly focusing on the handling of attention masks and the computation of attention probabilities for both CPU and CUDA providers. The most significant changes include the addition of a new CUDA implementation for converting boolean attention masks to sequence lengths with validation, and several bug fixes in the CPU attention kernel to correctly handle head indices during computation.

CUDA Attention Mask Conversion and Validation:

Added a new CUDA implementation (attention_mask_impl.cu and attention_mask_impl.h) that efficiently converts a boolean attention mask to sequence lengths for GQA (Grouped Query Attention) kernels. This includes:
- A CUDA kernel that processes each batch, validates that the mask starts with True and that padding is contiguous (right-padding only), and computes the correct sequence length per batch.
- Support for 2D, 3D, and 4D mask shapes with proper broadcasting logic.
- Error handling for masks that do not start with True or contain non-contiguous True/False values. [1] [2]

CPU Attention Kernel Bug Fixes:

Fixed bugs in the CPU attention kernel (attention.cc) by replacing incorrect uses of (head_i % parameters.kv_num_heads) and head_i with the correct head_ki and head_vi indices when accessing the K and V matrices. This ensures correct head alignment, especially in multi-head or grouped attention scenarios. [1] [2] [3] [4]

NOT supported in this PR

Cross attention (q_sequence_kength != kv_sequence_length)
4d QKV (BNSH format)
is_causal=False
fp32
Softmax precision
qk_output_mode

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

titaiwangms · 2026-01-20T22:12:27Z

@copilot Can you enable the following three tests

onnxruntime/onnxruntime/test/providers/cpu/llm/attention_op_test.cc

Line 866 in fd21d0a

TEST(AttentionTest, Attention3DGqaAttn) {

onnxruntime/onnxruntime/test/providers/cpu/llm/attention_op_test.cc

Line 898 in fd21d0a

TEST(AttentionTest, Attention4DGqaAttnMask) {

onnxruntime/onnxruntime/test/providers/cpu/llm/attention_op_test.cc

Line 933 in fd21d0a

TEST(AttentionTest, Attention4DGqaWithPastAndPresent) {

Just turn disable_cuda to false

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot · 2026-01-20T22:15:23Z

@copilot Can you enable the following three tests

onnxruntime/onnxruntime/test/providers/cpu/llm/attention_op_test.cc

Line 866 in fd21d0a

TEST(AttentionTest, Attention3DGqaAttn) {

...

Enabled CUDA tests for all three GQA test cases in commit 2e10874:

Attention3DGqaAttn
Attention4DGqaAttnMask
Attention4DGqaWithPastAndPresent

titaiwangms · 2026-01-20T22:17:22Z

@copilot
Also remove these disabled tests about GQA from the list:

onnxruntime/onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc

Lines 44 to 45 in fd21d0a

    
           "^test_attention_3d_gqa.*_cuda",  // GQA not supported in Attention-cuda 
        
           "^test_attention_4d_gqa.*_cuda",  // GQA not supported in Attention-cuda

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot · 2026-01-20T22:19:35Z

@copilot
Also remove these disabled tests about GQA from the list:
https://github.com/microsoft/onnxruntime/blob/fd21d0aa071edef70d715123b25e77891a63a5d9/onnxruntime/test/testdata/onnx_backend_test...

Removed GQA test filters from onnx_backend_test_series_filters.jsonc in commit b86acbd. The following lines were removed:

"^test_attention_3d_gqa.*_cuda"
"^test_attention_4d_gqa.*_cuda"

GQA tests will now run on CUDA.

titaiwangms · 2026-01-20T23:03:01Z

@copilot

Please fix the CI https://github.com/microsoft/onnxruntime/actions/runs/21189348181/job/60952159971?pr=27082

The error message:

LINK : warning LNK4075: ignoring '/INCREMENTAL' due to '/PROFILE' specification [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
     Creating library E:/_work/_temp/build/RelWithDebInfo/RelWithDebInfo/onnxruntime_providers_cuda.lib and object E:/_work/_temp/build/RelWithDebInfo/RelWithDebInfo/onnxruntime_providers_cuda.exp
attention.cc.obj : error LNK2019: unresolved external symbol "class onnxruntime::common::Status __cdecl onnxruntime::contrib::cuda::QkvToContext<float>(struct cudaDeviceProp const &,struct cublasContext * &,class onnxruntime::Stream *,struct onnxruntime::contrib::GroupQueryAttentionParameters &,struct onnxruntime::contrib::cuda::GroupQueryAttentionData<float> &)" (??$QkvToContext@M@cuda@contrib@onnxruntime@@YA?AVStatus@common@2@AEBUcudaDeviceProp@@AEAPEAUcublasContext@@PEAVStream@2@AEAUGroupQueryAttentionParameters@12@AEAU?$GroupQueryAttentionData@M@012@@Z) referenced in function "public: virtual class onnxruntime::common::Status __cdecl onnxruntime::cuda::Attention<float>::ComputeInternal(struct onnxruntime::OpKernelContext *)const " (?ComputeInternal@?$Attention@M@cuda@onnxruntime@@UEBA?AVStatus@common@3@PEAUOpKernelContext@3@@Z) [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
E:\_work\_temp\build\RelWithDebInfo\RelWithDebInfo\onnxruntime_providers_cuda.dll : fatal error LNK1120: 1 unresolved externals [E:\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_cuda.vcxproj]
Traceback (most recent call last):
  File "E:\_work\onnxruntime\onnxruntime\tools\ci_build\build.py", line 2684, in <module>

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot · 2026-01-20T23:06:11Z

@copilot

Please fix the CI https://github.com/microsoft/onnxruntime/actions/runs/21189348181/job/60952159971?pr=27082
...

Fixed CI linker error in commit 213a82d. The issue was that the GQA QkvToContext template function only had instantiations for half and BFloat16, but not for float. Added the missing float template instantiation along with GroupQueryAttentionData<float> and LaunchUnpackQKV<float, LAYOUT_BNSH> to group_query_attention_impl.cu.

titaiwangms · 2026-01-20T23:11:09Z

@copilot
Please fix the CI https://github.com/microsoft/onnxruntime/actions/runs/21189348181/job/60952159971?pr=27082
...

Fixed CI linker error in commit 213a82d. The issue was that the GQA QkvToContext template function only had instantiations for half and BFloat16, but not for float. Added the missing float template instantiation along with GroupQueryAttentionData<float> and LaunchUnpackQKV<float, LAYOUT_BNSH> to group_query_attention_impl.cu.

@copilot
Do not add float type. Just throw an error saying float is not supported with GQA. We are following GQA kernel.

onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc

Lines 38 to 39 in 46e8d45

    
           REGISTER_KERNEL_TYPED(MLFloat16) 
        
           REGISTER_KERNEL_TYPED(BFloat16)

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot · 2026-01-20T23:14:08Z

@copilot
Please fix the CI https://github.com/microsoft/onnxruntime/actions/runs/21189348181/job/60952159971?pr=27082
...
...

Reverted float support for GQA in commit f79c509. Following GroupQueryAttention pattern:

Removed float template instantiations from group_query_attention_impl.cu
Added type validation in attention.cc to throw an error for float32 types when using GQA path
GQA now only supports float16 and bfloat16 types, matching the GroupQueryAttention operator behavior

onnxruntime/contrib_ops/cpu/utils/debug_macros.h

onnxruntime/test/python/transformers/test_onnx_attention.py

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/test/python/transformers/test_onnx_attention.py

titaiwangms · 2026-01-30T21:04:28Z

onnxruntime/core/providers/cuda/llm/attention.cc

+  // Check if this is Group Query Attention (GQA)
+  const bool is_gqa = parameters.kv_num_heads != parameters.q_num_heads;
+
+  if (is_gqa) {


TODO: Currently, we do not support 4D inputs of QKV.

Added exeptions

The support requires kernel changes in FlashAttention and EfficientAttention. If we want to support 4d, the best way would be another cuda kernel to transpose/reshape the input from 4d to 3d before feeding it to those two attention kernels.

onnxruntime/core/providers/cuda/llm/attention.cc

onnxruntime/test/providers/cpu/llm/attention_op_test.cc

Initial plan

e538f4b

Copilot AI assigned Copilot and titaiwangms Jan 20, 2026

Copilot started work on behalf of titaiwangms January 20, 2026 21:52 View session

Copilot AI and others added 4 commits January 20, 2026 21:56

Add GQA support to Attention(23) CUDA operator

f2007a2

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Add debug tracking fields and num_splits parameter

53c333f

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Fix code review issues: use v_head_size and parameters.softcap

042ff32

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Set softcap to 0.0f explicitly with comment

0e7a632

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Implement support for group query attention in Attention(23)~~ Support group query attention in Attention(23) CUDA Jan 20, 2026

Copilot AI requested a review from titaiwangms January 20, 2026 22:05

Copilot finished work on behalf of titaiwangms January 20, 2026 22:05

Copilot started work on behalf of titaiwangms January 20, 2026 22:13 View session

Enable CUDA tests for GQA attention tests

2e10874

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot finished work on behalf of titaiwangms January 20, 2026 22:16

Copilot started work on behalf of titaiwangms January 20, 2026 22:17 View session

Remove GQA test filters from disabled tests list

b86acbd

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot finished work on behalf of titaiwangms January 20, 2026 22:20

Copilot started work on behalf of titaiwangms January 20, 2026 23:03 View session

Add float template instantiation for GQA QkvToContext

213a82d

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

Copilot finished work on behalf of titaiwangms January 20, 2026 23:07

Copilot started work on behalf of titaiwangms January 20, 2026 23:11 View session

Revert float support for GQA and add type validation

f79c509

Co-authored-by: titaiwangms <18010845+titaiwangms@users.noreply.github.com>

titaiwangms added the ep:CUDA issues related to the CUDA execution provider label Jan 21, 2026

titaiwangms added 7 commits January 22, 2026 22:30

change gqa tests to fp16

e52efb2

examine gqa parameters and move down MHA parameters

98c5dcf

Merge branch 'main' into copilot/support-group-query-attention

0f800f5

support gqa bool masking

4978e96

add flash/memory draft

4c644e2

Merge branch 'main' into copilot/support-group-query-attention

f04b38e

finish gqa default

16d5453

titaiwangms reviewed Jan 28, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/utils/debug_macros.h Outdated Show resolved Hide resolved

titaiwangms added 2 commits January 27, 2026 17:40

Apply suggestion from @titaiwangms

54d77ae

introduce python attention tests for gqa

87a5648

github-advanced-security bot found potential problems Jan 28, 2026

View reviewed changes

onnxruntime/test/python/transformers/test_onnx_attention.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_onnx_attention.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_onnx_attention.py Fixed Show fixed Hide fixed

github-actions bot reviewed Jan 28, 2026

View reviewed changes

onnxruntime/test/python/transformers/test_onnx_attention.py Show resolved Hide resolved

onnxruntime/test/python/transformers/test_onnx_attention.py Outdated Show resolved Hide resolved

onnxruntime/test/python/transformers/test_onnx_attention.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Jan 28, 2026

View reviewed changes

onnxruntime/test/python/transformers/test_onnx_attention.py Fixed Show fixed Hide fixed

titaiwangms added 3 commits January 28, 2026 23:25

lint

5981041

support attn_mask

6d7e50a

Merge branch 'main' into copilot/support-group-query-attention

d1cb063

titaiwangms marked this pull request as ready for review January 30, 2026 01:20

titaiwangms reviewed Jan 30, 2026

View reviewed changes

onnxruntime/core/providers/cuda/llm/attention.cc Outdated Show resolved Hide resolved

titaiwangms reviewed Jan 30, 2026

View reviewed changes

onnxruntime/test/providers/cpu/llm/attention_op_test.cc Outdated Show resolved Hide resolved

titaiwangms reviewed Jan 30, 2026

View reviewed changes

onnxruntime/test/providers/cpu/llm/attention_op_test.cc Outdated Show resolved Hide resolved

titaiwangms added 2 commits January 30, 2026 22:51

clean up and use ORT_MAKE_STATUS

e2a4032

Merge branch 'main' into copilot/support-group-query-attention

dcb937a

titaiwangms reviewed Jan 31, 2026

View reviewed changes

onnxruntime/test/providers/cpu/llm/attention_op_test.cc Outdated Show resolved Hide resolved

fix cpu bugs on fp16

2509464

titaiwangms requested a review from tianleiwu February 3, 2026 21:43

Merge branch 'main' into copilot/support-group-query-attention

3b17c71

titaiwangms mentioned this pull request Feb 4, 2026

Support boolean attention mask in Attention(23) CUDA #27129

Draft

Merge branch 'main' into copilot/support-group-query-attention

cd992ee

Conversation

Copilot AI commented Jan 20, 2026 • edited by titaiwangms Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

titaiwangms commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

titaiwangms commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

titaiwangms commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

titaiwangms commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

titaiwangms Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 20, 2026 •

edited by titaiwangms

Loading